{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Appendix \n",
    "This is an appendix notebook for explanations that we skipped in the blogs by [Inveterate Learner](https://medium.com/inveterate-learner) either because it would make it too long or was a bit repetitive."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  1. Deep Learning Book: Chapter 7 - Regularization for Deep Learning ([Link to post](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c))\n",
    "\n",
    "**Explanation for the weight update of $L^1$ regularization**\n",
    "\n",
    "$$ \\bigtriangledown_w \\tilde{J}(\\theta; X, y) = \\bigtriangledown_w J(\\theta; X, y) + \\alpha * sign(w) $$\n",
    "$$ \\bigtriangledown_w J(\\theta; X, y) = H(w - w^*)$$\n",
    "\n",
    "Now, we'll have to look at each unit and not the entire *w* vector:\n",
    "- Case 1: sign($w_i$*) > 0\n",
    "\n",
    "Equating the gradient to zero, we get:\n",
    "$$ H_{i, i}(w_i - w_i^*) + \\alpha = 0$$\n",
    "$$ \\Rightarrow w_i = w_i^* - \\frac {\\alpha}{H_{i, i}} $$\n",
    "\n",
    "Similarly,\n",
    "- Case 2: sign($w_i$*) > 0\n",
    "$$ \\Rightarrow w_i = w_i^* + \\frac {\\alpha}{H_{i, i}} $$\n",
    "$$ \\Rightarrow w_i = - (-w_i^* -\\frac {\\alpha}{H_{i, i}}) $$\n",
    "\n",
    "Therefore, overall we get the following:\n",
    "\n",
    "$$ w_i = sign(w_i*)(|w_i^*| - \\frac {\\alpha}{H_{i, i}}) $$\n",
    "\n",
    "The explanation for why `max` occurs is given in the post itself."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  2. Deep Learning Book: Chapter 8- Optimization For Training Deep Models ([Link to post (TODO)](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c))\n",
    "\n",
    "**Standard Error of the mean estimated from N samples**\n",
    "\n",
    "Assume that each of the samples is normally distributed, i.e. each $X_i \\sim \\mathcal{N}(\\mu, \\sigma^2)$, where $\\mu$ is the mean and $\\sigma^2$ is the variance.\n",
    "Then, estimated mean, $\\hat{\\mu}$ is given by:\n",
    "\n",
    "$$ \\hat{\\mu} = \\frac{\\sum_{i=1}^{n} X_i}{n} $$\n",
    "\n",
    "Therefore, remembering that: $var(\\frac{x}{n}) = \\frac{var(x)}{n^2}$, the variance of $\\hat{\\mu}$ is given by:\n",
    "\n",
    "$$ var(\\hat{\\mu}) = \\sum_{i=1}^{n} var(\\frac{X_i}{n}) $$\n",
    "$$ \\Rightarrow var(\\hat{\\mu}) = \\sum_{i=1}^{n} \\frac{\\sigma^2}{n^2} $$\n",
    "$$ \\Rightarrow var(\\hat{\\mu}) =  n \\frac{\\sigma^2}{n^2} $$\n",
    "$$ \\Rightarrow var(\\hat{\\mu}) =  \\frac{\\sigma^2}{n} $$ \n",
    "\n",
    "Now, Standard Error (S.E.) of any variable X is given by $\\sqrt{var(X)}$. Therefore:\n",
    "\n",
    "$$ S.E.(\\hat{\\mu}) = \\frac{\\sigma}{\\sqrt{n}} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Critical point of 2nd Order Taylor Series Approximation of J($\\theta$)**\n",
    "\n",
    "$$ J'(\\theta) = J'(\\theta_0) + \\frac{d}{d\\theta} \\left[(\\theta - \\theta_0)^T \\triangledown_{\\theta}J(\\theta_0) \\right]+ \\frac{d}{d\\theta} \\left[\\frac{1}{2}  (\\theta - \\theta_0)^T H  (\\theta - \\theta_0) \\right] $$\n",
    "\n",
    "$$ J'(\\theta_0) = 0 \\hspace{.5cm} \\text{as it is a constant}$$\n",
    "\n",
    "$$ \\frac{d}{d\\theta} \\left[(\\theta - \\theta_0)^T \\triangledown_{\\theta}J(\\theta_0) \\right] = \\triangledown_{\\theta}J(\\theta_0) * \\frac{d}{d\\theta} (\\theta - \\theta_0)^T + (\\theta - \\theta_0)^T  * \\frac{d}{d\\theta} \\triangledown_{\\theta}J(\\theta_0) \\hspace{.5cm} \\text{using the u-v method of differentiation} $$\n",
    "\n",
    "$$ \\hspace{4cm} = \\triangledown_{\\theta}J(\\theta_0) + 0 \\hspace{.5cm} \\text{as } \\triangledown_{\\theta}J(\\theta_0)  \\text{ is a constant} $$\n",
    "\n",
    "$$ \\frac{d}{d\\theta} \\left[\\frac{1}{2}  (\\theta - \\theta_0)^T H  (\\theta - \\theta_0) \\right] = \\frac{1}{2} H (\\theta - \\theta_0) * \\frac{d}{d\\theta} (\\theta - \\theta_0)^T + \\frac{1}{2} (\\theta - \\theta_0)^T  * \\frac{d}{d\\theta}  H (\\theta - \\theta_0) \\hspace{.5cm} \\text{similarly} $$\n",
    "\n",
    "$$ \\hspace{4cm} = H (\\theta - \\theta_0)  \\text{ property of matrix differentiation} $$\n",
    "\n",
    "So, overall:\n",
    "\n",
    "$$ J'(\\theta) = \\triangledown_{\\theta}J(\\theta_0) +  H (\\theta - \\theta_0) $$\n",
    "\n",
    "At the critical point, $\\theta^*$, $J'(\\theta^*)$ = 0. Therefore:\n",
    "\n",
    "$$ 0 = \\triangledown_{\\theta}J(\\theta_0) + H (\\theta^* - \\theta_0) $$\n",
    "$$ \\theta^* = \\theta_0 -  H ^ {-1}\\triangledown_{\\theta}J(\\theta_0) $$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Explanation of how negative curvature results in standard gradient computation**\n",
    "\n",
    "If the eigenvalues of H are too negative, $\\alpha$ needs to be very high to compensate for that, in which case the term (H + $\\alpha$ I) is  dominated by $\\alpha$ I.\n",
    "\n",
    "$$ \\mathcal{H} + \\alpha I \\approx \\alpha I $$\n",
    "\n",
    "$$ \\Rightarrow \\theta^* \\approx \\theta_0  - [\\alpha I]^{-1} \\bigtriangledown_{\\theta} f(\\theta_0)$$\n",
    "\n",
    "$$ \\Rightarrow \\theta^* \\approx \\theta_0  - \\frac{\\bigtriangledown_{\\theta} f(\\theta_0)}{\\alpha}$$\n",
    "\n",
    "whereas, the standard gradient descent update would be given by:\n",
    "\n",
    "$$ \\Rightarrow \\theta^* \\approx \\theta_0  - \\epsilon \\bigtriangledown_{\\theta} f(\\theta_0) \\hspace{.5cm} \\text{with } \\epsilon \\text{ being the learning rate} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Explanation of how large weights cause symmetry breaking during initialization**\n",
    "\n",
    "Suppose the eigen-value decomposition of W is given by: $ W = Q V Q^{-1}$ where V is the diagonal matrix of eigen values. Now, if a noise of $\\epsilon$ is added to the input, upon doing W \\* x an extra term W * $\\epsilon$ appears at the output. This $\\epsilon$ term scales the diagonal matrix V. So, if the eigenvalues of W are $\\lambda_1$, $\\lambda_2$, etc., it becomes $\\lambda_1 \\epsilon$, $\\lambda_2 \\epsilon$, etc. Thus, if W had similar eigenvalues for all its eigen directions, i.e. $\\lambda_1 \\approx \\lambda_2$, etc., then $\\lambda_1 \\epsilon \\approx \\lambda_2 \\epsilon$, which means that using different eigen directions didn't give anything extra. However, if the eigen values differ a lot, then multiplication with $\\epsilon$ will increase that difference. This is making a much better use of different eigen directions and thus, has a symmetry breaking effect./"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
